The dataset for crime in Seattle in 2014 contains more than 2000 instances with missing coordinate data (latitude and longitude). As an aid in investigating whether there is a systematic relationship between missingness of this data and other variables in the dataset, I constructed various interactive plots.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1.9000 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
sanfrancisco <- read.csv("sanfrancisco_incidents_summer_2014.csv")
seattle <- read.csv("seattle_incidents_summer_2014.csv")
#To avoid repetition of code in the generation various conditional distributions, I
#constructed a function.
calculate_conditional_distribution <- function(column) {
column <- enquo(column)
zero_assess <- seattle %>%
mutate(Position_Data_Missing = Latitude == 0) %>%
group_by(Position_Data_Missing, !!column) %>%
summarize(m = n()) %>%
group_by(!!column) %>%
mutate(n = sum(m)) %>%
ungroup() %>%
mutate(sum_total = sum(m)) %>%
group_by(Position_Data_Missing) %>%
mutate(sum_zero_test = sum(m)) %>%
mutate(total_dist = n / sum_total) %>%
mutate(conditional_distribution = m / sum_zero_test) %>%
ungroup() %>%
group_by(!!column)
#initialize list of items to be returned (a d.f. and a ggplot)
return_list <- list()
return_list[["df"]] <- zero_assess
#create visualizations for assessment of independence
plot <- zero_assess %>%
filter(n > 10) %>%
mutate(Position_Data_Missing =
ifelse(Position_Data_Missing == TRUE,
"missing", "not missing")) %>%
ggplot() +
geom_col(aes_(x = column, y = ~conditional_distribution,
fill = ~Position_Data_Missing)) +
labs(x = str_c(quo_name(column),
"\n(move cursor over bars to see the categories)"),
y = "conditional distribution") +
ggtitle("Assessing the Effect of Missingness of Position Data") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_blank(),
legend.title = element_blank()) +
guides(fill = guide_legend(reverse=TRUE))
return_list[["ggplot"]] <- plot
interactive_plot <-
ggplotly(plot) %>%
layout(margin = list(b = 50, l = 60, r = 10, t = 80))
return_list[["plotly"]] <- interactive_plot
return(return_list)
}
We may thus find the distributions of certain general categories of crime, specfied by a “summarized offense description”, conditional on both missingness of and non-missingness.
results <- calculate_conditional_distribution(Summarized.Offense.Description)
results[["plotly"]]
Overserving that the heights of the red and green portions are roughly the same in most of the cases, we find little evidence of a systematic relationship between missingness of coordinate data and the particular type of reported crime. However, moving the cursor over the bars, we see that prostitution is a notable exception, with perhaps all of the cases recorded with coordinate data. Examination of the data frame generated with the above code shows that there were 202 cases of prostition during the year (from 12/29/2014 through 1/2/2015), and indeed that all of the recorded cases had coordinate data.
Are there other types of crime for which this is the case? Let’s see.
#A simple manipulation of the data derived in calculate_conditional_distribution suffices.
results <- calculate_conditional_distribution(Summarized.Offense.Description)
results[["df"]] %>% filter(m == n)
## # A tibble: 9 x 8
## # Groups: Summarized.Offense.Description [9]
## Position_Data_Mi… Summarized.Offens… m n sum_total sum_zero_test
## <lgl> <fct> <int> <int> <int> <int>
## 1 F [INC - CASE DC US… 5 5 32779 30729
## 2 F DISORDERLY CONDUCT 2 2 32779 30729
## 3 F DUI 34 34 32779 30729
## 4 F ELUDING 8 8 32779 30729
## 5 F ESCAPE 3 3 32779 30729
## 6 F HOMICIDE 8 8 32779 30729
## 7 F PORNOGRAPHY 3 3 32779 30729
## 8 F PROSTITUTION 202 202 32779 30729
## 9 F PUBLIC NUISANCE 4 4 32779 30729
## # ... with 2 more variables: total_dist <dbl>,
## # conditional_distribution <dbl>
Although a chi-squared test may be useful here, what we have so far suggests at least a weak systematic relationship between missingness and the particular crime reported in an instance. Perhaps prostitution was recorded with somewhat greater meticulousness? Although we can’t infer that this is the case, the data contains a suggestion that it might be.
We now turn to the relationship betwen district sector and missingness.
results <- calculate_conditional_distribution(District.Sector)
results[["plotly"]]
The mostly-red bar at the left end corresonds to data for which no district sector is indicated in the data, for which it would not be surprising that the coordinate data is also missing. Other than, maybe, the mysterious District Sector 99–with a total of only 40 reported instances–the data suggests that the reporting of examples was consistent accross the sectors.
Similar considerations apply to the possible relationship between missingness and Zone Beat.
results <- calculate_conditional_distribution(Zone.Beat)
results[["plotly"]]
In this case, minor differences in the conditional distributions are evident, but likely not significant. (I may conduct a Chi Squared test of this assumption.) We may thus reasonably suppose that a plot of the crime data on a map of the city would not distort the distribution of actual recorded crime across Seattle. Thus plotting all of the instantances that have position data (latitude and longitude) in the table, we obtain the following.
library(ggmap)
##
## Attaching package: 'ggmap'
## The following object is masked from 'package:plotly':
##
## wind
seattle_gg <- get_map("Seattle", maptype = "toner-lite",
zoom = 11)
## maptype = "toner-lite" is only available with source = "stamen".
## resetting to source = "stamen"...
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Seattle&zoom=11&size=640x640&scale=2&maptype=terrain&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Seattle&sensor=false
## Map from URL : http://tile.stamen.com/toner-lite/11/326/714.png
## Map from URL : http://tile.stamen.com/toner-lite/11/327/714.png
## Map from URL : http://tile.stamen.com/toner-lite/11/328/714.png
## Map from URL : http://tile.stamen.com/toner-lite/11/329/714.png
## Map from URL : http://tile.stamen.com/toner-lite/11/326/715.png
## Map from URL : http://tile.stamen.com/toner-lite/11/327/715.png
## Map from URL : http://tile.stamen.com/toner-lite/11/328/715.png
## Map from URL : http://tile.stamen.com/toner-lite/11/329/715.png
## Map from URL : http://tile.stamen.com/toner-lite/11/326/716.png
## Map from URL : http://tile.stamen.com/toner-lite/11/327/716.png
## Map from URL : http://tile.stamen.com/toner-lite/11/328/716.png
## Map from URL : http://tile.stamen.com/toner-lite/11/329/716.png
#Violent_Crimes <- c("ASSAULT", "HOMICIDE", "ROBBERY")
#remove zero latitude and longitude elements, for mapping
seattle_nonzero <- seattle %>%
filter(Latitude != 0 & Longitude != 0)
#data <- seattle_nonzero %>%
# filter(is.element(Summarized.Offense.Description, Violent_Crimes))
#data <- seattle_nonzero %>%
# filter(Summarized.Offense.Description == "NARCOTICS")
ggmap(seattle_gg, darken = c(.01, "black")) +
geom_bin2d(data = seattle_nonzero,
aes(x = Longitude, y = Latitude),
bins = 200) +
scale_fill_gradient(trans = "log10")